googologywikiaorg-20200223-history
User blog:Vel!/Chomsky hierarchy
Googologists are interested in generating fast-growing functions from hypercomputational models such as oracle Turing machines, SKIO calculus, and infinite time Turing machines. In this blog post, we explore a small set of computational models that are weaker than Turing machines. As always I have LittlePeng9 to thank for various corrections to this blog post. Strings and languages An alphabet is a set \(\Sigma\); in this document we care only about finite alphabets. The elements of an alphabet are its symbols. A string in \(\Sigma\) is an ordered tuple of zero or more elements in \(\Sigma\); by the definition of ordered tuples all strings are finite. Inspired by programming conventions, we use double quotes "" to denote strings. We use \(\epsilon\) to denote the empty string. There's only one interesting operation on strings, which is concatenation. This means combining the strings end-to-end, and we notate it by simply putting the strings next to each other. For example, "pine" "apple" = "pineapple". When variables represent strings, this is confusingly the same notation as multiplication. A formal language is a set of strings, interpreted as the strings that the language considers grammatically correct. We care only about languages whose strings belong to finite alphabets, but languages can themselves be infinite. There are three operations on formal languages that we're interested in. The first is the language union \(L \cup M\), which is exactly the set union. The union of two languages consists of the strings that belong to either language. A more interesting one is language concatenation \(LM\), defined as the set of strings formed by concatenating one string from \(L\) with one string from \(M\). For example: :{"near", "far"}{"sighted", "ness"} = {"nearsighted", "farsighted", "nearness", "farness"} Using set-builder notation, we formally define \(LM = \{ab | a \in L \wedge b \in M\}\). The third operation is the Kleene star \(L^*\). \(L^*\) is defined as the language formed by concatenating any number of strings from \(L\). Since "any number" includes zero, \(L^*\) always contains the empty string \(\epsilon\) (the result of concatenating zero strings). An example of the Kleene star: :{"0", "1", "2"}* = {"", "0", "1", "2", "00", "01", "02", "10", "11", "12", "20", "21", "22", "000", "001", "002", ...} Formally, it is the smallest language containing \(\epsilon\) and all the strings in \(L\) that is closed under string concatenation. Regular grammars Formal languages themselves are not as interesting as formal grammars, which are a method for generating formal languages. An infinite list of valid C programs is not as useful as a document that completely describes its syntax. We'll start with a subclass of formal grammars called regular grammars. I'll defer a formal definition of regular grammars until a later section; for now I'll begin with an example. variables: S N E B J terminals: a b d e h j m n t u y , . ! space start symbol: S productions: "S" => "naN" "N" => "E" "N" => " naE" "E" => "J" "E" => "B" "J" => ", hey jude." "B" => " batman!" The lines containing => mean "replacement." We begin with "S" and select appropriate replacements until our string has no variables in it. Here's an example of how to generate a string from this grammar: S ("S" => "naN") naN ("N" => " naN") na naN ("N" => " naN") na na naN ("N" => " naN") na na na naN ("N" => "E") na na na naE ("E" => "B") na na na naB ("B" => " batman!") na na na na batman! (done) If we collect all the strings that can be generated by this process (start with the start symbol and pick replacements until there are no variables left), we get a set of strings. This set is the language generated by the grammar. The language, as it turns out, is :{"na batman!", "na na batman!", "na na na batman!", ..., "na, hey jude.", "na na, hey jude.", "na na na, hey jude.", ...} There is a restriction on regular grammars that is not clear from the above example. Notice that the right-hand side of each production takes on the form "or more terminalsor one variables", so productions such as "S" => "naNE!" are illegal. More obvious is the restriction that every left-hand side can only contain a single variable, so productions such as "SE" => "naNE" are illegal. A language that is generated by a regular grammar is called a regular language. An equivalent definition of a regular language is any language that can be constructed from unions, concatenations, and Kleene stars, starting with finite languages. For example, our Batman example is also :{"na"}{" na"}*{" batman!", ", hey jude."} Context-free grammars If we lift the restriction and allow productions to mix variables and terminals arbitrarily, we get context-free grammars, which generate context-free languages. Here is a fairly practical example of a context-free grammar: variables: D N E terminals: 0 1 2 3 4 5 6 7 8 9 + - * / start symbol: E productions: "D" => "0" "D" => "1" ... "D" => "9" "N" => "D" "N" => "DN" "E" => "N" "E" => "(E)" "E" => "E+E" "E" => "E-E" "E" => "E*E" "E" => "E/E" As an example, we show that this grammar can generate 1234/(4+2*86): E E/E N/E DN/E DDN/E DDDN/E DDDD/E 1DDD/E 12DD/E 123D/E 1234/E 1234/(E) 1234/(E+E) 1234/(E+E*E) 1234/(N+E*E) 1234/(N+N*E) 1234/(N+N*N) 1234/(D+N*N) 1234/(D+D*N) 1234/(D+D*DN) 1234/(D+D*DD) 1234/(4+D*DD) 1234/(4+2*DD) 1234/(4+2*8D) 1234/(4+2*86) Note that this is not the only way to derive the string. We can do some things out of order — I could have chosen to leave the 1234 as N for a while, and hold off expanding it until the very end. All regular grammars are context-free, but not all context-free grammars are regular. Our calculator example above is not regular (although proving this is beyond the scope of this blog post, as well as my skills). While the right-hand sides of the productions are now virtually unrestricted, the left-hand sides are still constrained to single variables. Context-sensitive grammars With some thought, you can make sense of the name "context-free grammar." The replacement process can look at single variables, but not their surroundings. A context-sensitive grammar allows you to replace single variables depending on context, as long as the context is left alone. "Context-sensitive grammar" is an awful name. The term appears to be the opposite of context-free, which is not at all true. See if you can figure out what this context-sensitive grammar does: variables: L S R terminals: a-z start symbol: S productions: "S" => "LR" # This is a common way to abbreviate multiple productions with the same LHS "L" => "a" | "b" | ... | "y" | "z" "aR" => "abR" | "acR" | ... | "ayR" | "azR" "bR" => "bcR" | "bdR" | ... | "byR" | "bzR" "cR" => "cdR" | "ceR" | ... | "cyR" | "czR" ... "xR" => "xyR" | "xzR" "yR" => "yzR" "zR" => "z" Example: S LR cR cfR cfgR cfghR cfghjR cfghjmR cfghjmpR cfghjmpsR cfghjmpsxR cfghjmpsxyzR cfghjmpsxyz This particular example only used the left-hand side of the context, but general context-sensitive grammars allow contexts such as "aRa" -- as long as the production rule does not alter the context in replacement. Chomsky proposes context-sensitive grammars as a model for . Unrestricted grammars Unrestricted grammars are not actually 100% unrestricted. The productions still have to contain at least one variable in their left-hand sides, otherwise you would be able to apply production rules after the process has terminated. The languages generated by unrestricted grammars are called "recursively enumerable languages." Chomsky hierarchy So far we have discussed four different types of languages: * regular (type 3) * context-free (type 2) * context-sensitive (type 1) * recursively enumerable (type 0) These four form the Chomsky hierarchy. You might already have a sense of how formal languages can be thought of as computational models. Suppose we have a decision problem over strings in an alphabet, such as "is \(S\) a string of a's whose length is a power of two?", or "is \(S\) a subset of the Latin alphabet, presented in sorted order?". Then we ask, "is there a type X language that contains a string if and only if it satisfies the decision problem?" For instance, my example context-sensitive language solves the second decision problem. We say that a Turing machine recognizes a language iff there is a TM that can correctly output when a string is in the language. If the string isn't in the language, then it can do anything other than output "yes," including going on infinitely without halting. It turns out that the recursively enumerable languages are precisely those that can be recognized by Turing machines. By using weaker analogs of Turing machines, we can make analogous statements for the other three language types. Namely: * A language is regular iff a finite state automaton can recognize it. * A language is context-free iff a nondeterministic pushdown automaton can recognize it. * A language is context-sensitive iff a linear bounded automaton can recognize it. Hacking a regular grammar It's no coincidence that "regular language" and "regular expression" share a word. Regular expressions are themselves a notation for regular grammars. The set of strings matched by a given regular expression is a regular language. I should note, however, that some regular expression implementations have a feature called "backreferencing" which can generate languages that are not even context-free. In possibly the first time computer security and googology have intersected, infosec researchers have described an attack called a Regular Expression denial of service. The ReDoS attack involves exploiting naive algorithms for regular expression matching. The heart of ReDoS is the fact that some strings in some grammars have many possible "paths" leading to them. Consider the following regular grammar: variables: A terminals: a start symbol: A productions: "A" => "a" "A" => "aA" "A" => "aaA" "A" => "aaaA" With a sufficiently long string of a's, the number of ways to generate it becomes exponential-ish w.r.t. the length. (An actual ReDoS attack involves finding a long string that isn't in the regular language, but still tricks a naive algorithm into trying numerous paths number exponential w.r.t. the string length before failing.) This gives us a fast-growing function: :Define \(\text{Regular}(n,m)\) as the largest number of paths leading to a string of length at most \(m\) according to a regular grammar with no more than \(n\) productions. Unfortunately, some regular grammars can allow arbitrarily many paths, by adding productions such as "A" => "A". We simply ignore such cases, refining our definition to the following: :Let \(G\) be a regular grammar with \(\leq n\) productions, and let \(S\) be a string of length \(\leq m\) that can be derived from the grammar via \(P\) unique paths, where \(P\) is finite. Then \(\text{Regular}(n,m)\) is defined as the maximal value of \(P\). Generalizing From here, we can define \(\text{Context-free}(n,m)\), \(\text{Context-sensitive}(n,m)\), and \(\text{Unrestricted}(n,m)\). Regular, Context-free, and Context-sensitive are computable, but Unrestricted is not. Unfortunately, it's not quite so straightforward. A context-free grammar with "A" => "BB...BB"; "B" => "" allows us to increase the number of B's arbitrarily, leading to arbitrarily large path counts. This makes the three new functions ill-defined. Capping the lengths of the production rules fixes this. (This "vulnerability" didn't affect Regular, but I have amended it anyway for consistency with the other three.) :Let \(G\) be a regular/context-free/context-sensitive/unrestricted grammar with \(\leq n\) productions, each with left- and right-hand sides with \(\leq n\) symbols. Let \(S\) be a string of length \(\leq m\) that can be derived from the grammar via \(P\) unique paths, where \(P\) is finite. Then \(\text{Regular/Context-free/Context-sensitive/Unrestricted}(n,m)\) is defined as the maximal value of \(P\). If what we want are unary functions, we can set \(n = m\) and define \(\text{Regular}(n) = \text{Regular}(n,n)\). This is very experimental. Hopefully there is no hidden relationship between \(n\) and \(m\) that causes this to fail. How fast do they grow? I'm not sure. I suspect that, if nondegenerate, Regular grows at \(n^n\)-ish and Unrestricted outgrows all recursive functions. I don't think the two middle functions are particularly strong. LittlePeng9 conjectures that Context-sensitive is only doubly exponential. "All this work for a meager double exponential function?!" you exclaim. Maybe the first three functions won't be beating any records, but I still find this an interesting study in extracting fast-growing functions from the study of formal languages. Loose ends The definition I gave for regular grammars is a little different from the typical definitions — I expanded it for the sake of a readable example. The more standard definition is that each right-hand side is either a single terminal, a single terminal followed by a single nonterminal, or the empty string. The two definitions generate the same set of languages — the regular languages. In the Chomsky hierarchy section, the definition of "recognizability" simply requires that Turing machine outputs "yes" iff the given string is in the language. It need not output "no" when the string is not in the language; it can even go on infinitely. If we require that the Turing machine outputs "no," the language produced belongs to a new class called the recursive languages or decidable languages. These are not in the Chomsky hierarchy, but if they were, they would be between types 0 and 1. Unfortunately, I know of no simple definition for "decidable grammars" (aside from artifically deriving one from the definition of decidable languages). Here's a quick exercise for you. We say that a language is listable iff there is a TM that, given blank input, eventually outputs all the strings in that language. Show that a language is listable iff it is recursively enumerable. Formal definition A formal grammar is a quadruple \(G = (N, \Sigma, P, S)\) where \(N\) and \(\Sigma\) are finite, disjoint sets, \(P \subseteq TNT \times T\) where \(T = T(G) = (N \cup \Sigma)^*\), and \(S \in N\). We define the binary relation \(\Rightarrow_G\) on \(T\) as follows: \\Rightarrow_G B \Leftrightarrow \exists u, v, p, q \in T: A = upv \wedge B = uqv \wedge (p,q) \in P\ For \(H \in T^+\), we say that \(\text{Path}_G(A, B, H)\) iff the first element of \(H\) is \(A\), the last element of \(H\) is \(B\), and every consecutive pair \((C,D)\) of elements in \(H\) has \(C \Rightarrow_G D\). We define \(\text{Paths}_G(A) = \#\{H : \text{Path}_G(U, A, H)\}\) where \(U\) is a singleton string containing only \(S\). We define \(\text{MaxPath}_G(m)\) as the maximal finite value of \(\text{Paths}_G(A)\) for all \(A \in \Sigma^{\leq m}\). A formal grammar \(G\) is regular iff \(P \subseteq N \times (\{\epsilon\} \cup \Sigma \cup \Sigma N)\). A formal grammar \(G\) is context-free iff \(P \subseteq N \times T\), and context-sensitive iff for every \((r, \ell) \in P\), there exists \(p \in N\) and \(q, u, v \in T\) such that \(r = upv\) and \(\ell = uqv\). A formal grammar \(G\) is called \(n\)-limited iff \(\#(P) \leq n\) and for every \((r, \ell) \in P\), the lengths of \(r\) and \(\ell\) are \(\leq n\). We define \(\text{Regular}(n, m)\) as the maximal value of \(\text{MaxPath}_G(m)\) for regular \(n\)-limited \(G\). We define \(\text{Context-free}(n, m)\) as the maximal value of \(\text{MaxPath}_G(m)\) for context-free \(n\)-limited \(G\). We define \(\text{Context-sensitive}(n, m)\) as the maximal value of \(\text{MaxPath}_G(m)\) for context-sensitive \(n\)-limited \(G\). We define \(\text{Unrestricted}(n, m)\) as the maximal value of \(\text{MaxPath}_G(m)\) for \(n\)-limited \(G\). Category:Blog posts